A simple speech pattern recognition by utilizing MFCCs (Mel Frequency Cepstral Coefficients) and Dynamic Time Warping (DTW) to match given template speech set with the test set.
Required additional library to run this notebook:
Librosa
@edwardpassagi on GitHub
# Library Import and basic function definition
from scipy.io import wavfile
import random
import scipy
import scipy.spatial.distance as dis
import scipy.signal as signal
import numpy as np
import IPython.display as ipd
import librosa
# Progress Visualization
from tqdm import tqdm
from tqdm.auto import tqdm, trange
# Ignore MFCC warning due to wavfile tag
import warnings
warnings.filterwarnings('ignore')
# Print Sound
def sound( x, rate=8000, label=''):
from IPython.display import display, Audio, HTML
if label is '':
display( Audio( x, rate=rate))
else:
display( HTML(
'<style> table, th, td {border: 0px; }</style> <table><tr><td>' + label +
'</td><td>' + Audio( x, rate=rate)._repr_html_()[3:] + '</td></tr></table>'
))
Since I'll be comparing MFCC data for each audio (template and test), make sure to get both the WAV file and the MFCC representation for each audio file.
# sr = 44100
sr = wavfile.read("./digits_samples/template.wav")[0]
# take L channel
template = np.array(wavfile.read("./digits_samples/template.wav")[1][:,0], dtype=float)
test = np.array(wavfile.read("./digits_samples/test.wav")[1][:,0], dtype=float)
# find MFCC for both sets
templateMFCC = librosa.feature.mfcc(template, sr, n_mfcc = 50)
testMFCC = librosa.feature.mfcc(test, sr, n_mfcc = 50)
# parse template to 10 MFCC and 10 digits
tempMFList = []
tempDigs = np.array(np.array_split(template,10))
# parse testing to 110 MFCCs and 110 digits
testMFList = []
testDigs = np.array(np.array_split(test,110))
for i in range(10):
tempMFList.append(librosa.feature.mfcc(tempDigs[i], sr, n_mfcc = 50))
for i in range(110):
testMFList.append(librosa.feature.mfcc(testDigs[i], sr, n_mfcc = 50))
# sound of some template digits
print("Template digits")
for i in range(0,10,2):
pr = "number: "+str(i)
sound(tempDigs[i], sr, pr)
# sound of some test digits
print("Test digits")
for i in range(90,100,2):
pr = "number: "+str(i)
sound(testDigs[i], sr, pr)
Since each digits (or speech) can be spoken in vastly different ways, we want to make sure to ignore small, irrelevant differences to make up the meaning of the voice.
Thus, we can approach this problem by comparing each MFCC slices for the test and the templates, while finding the most optimal route (low cost) to determine our predicted result.
In this algorithm, we'll be using Bellmann's Optimality Principle for our pathfinding method. (Further reading here.
First, we need to represent the distances of our representative matrix (in my case, the MFCC form). This can be done by this formula:
$$D(\mathbf{a},\mathbf{b}) = \frac{\sum a_i b_i}{\sqrt{a_i^2}\sqrt{\sum b_i^2}}$$
where (i,j) represents the distance between the i-th frame of the template with the j-th frame of the input.
def D_mat(a,b):
D = np.zeros((len(a.T), len(b.T)))
for i, matA in enumerate(a.T):
for j, matB in enumerate(b.T):
# get cosine distance between the two frames
D[i,j] = dis.cosine(matA,matB)
return D
Second, we can compute the cost matrix, where we find the most optimal path to reach (i,j), in this case, let's set a constraint to just consider nodes coming from (i-1,j), (i-1,j-1), and (i-1,j-1).
def C_mat(D):
C = D.copy()
for i in range(1, C.shape[0]):
for j in range(1, C.shape[1]):
curr = C[i,j]
W = C[i, j-1] + curr
NW = C[i-1,j-1] + curr
N = C[i-1,j] + curr
# assign lowest value to C matrix
C[i,j] = np.nanmin([W,NW,N])
return C
Finally, we can form our classifier by using the lowest cost path to determine our prediction.
def classify(inputMFCC, templateMFCC, window = 40):
retval = np.zeros(len(templateMFCC))
for i, templateFrame in enumerate(templateMFCC):
D = D_mat(inputMFCC, templateFrame)
C = C_mat(D)
# get minimum cost from both edges
# only consider the last half
opt = min(min(C[window:,-1]),min(C[-1,window:]))
retval[i]=opt
# return the minimum index
return np.argmin(retval)
We are now done with our algorithm, and can now test it with our test sets.
Remember that the actual digit (0-9) is represented as testIndex mod(10).
# Test on digit 3
testIndex = 93
predicted = classify(testMFList[testIndex], tempMFList)
# testMFList
# classify_c(tempMFList[2], tempMFList)
print("Actual: {}, Predicted: {}".format(testIndex%10,predicted))
sound(testDigs[testIndex], sr, "Actual digit")
sound(tempDigs[predicted], sr, "Template digit")
It seems that our algorithm works just fine. To determine its accuracy, we can run it on many different test sets.
I have 110 different sounds that I can test it against.
correct = np.zeros(10)
for i in trange(110, desc = 'Test set'):
guessedval = classify(testMFList[i], tempMFList)
actualIndex = i%10
# Determine whether or not its correctly guessed
correct[actualIndex] = correct[actualIndex]+1 if guessedval==actualIndex else correct[actualIndex]
We can now see our accuracy for each digits:
## Data Summary
print("Data Summary:\n")
totalCorrectDigit = int(np.sum(correct))
print("Total Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalCorrectDigit/110*100, totalCorrectDigit, 110-totalCorrectDigit))
for idx, c in enumerate(correct):
print("Digit {} Accuracy: {}%".format(idx, c/11*100) )
We can now implement what we have made for something that can be used in our daily lives.
In this case, we can make a voice-driven dialler, where we can set up numbers for each our friends (by saying each friend's name followed by their phone number), and call them just by saying their names.
Let's start by importing our audio files, and parsing them (similar to step 2).
# sr = 44100
sr = wavfile.read("./voice_dialler/input.wav")[0]
# take L channel
tempVD = np.array(wavfile.read("./voice_dialler/input.wav")[1][:,0], dtype=float)
testVD = np.array(wavfile.read("./voice_dialler/names.wav")[1][:,0], dtype=float)
print("Input data:")
sound(tempVD, sr, "Setup audio")
sound(testVD, sr, "Test names")
| Names | Phone Number |
|---|---|
| Furkan | 1379 |
| Simon | 5240 |
| Mohamed | 6683 |
| Edward | 7134 |
| Amir | 9523 |
recipientNum = 5
phoneDigitsAmt = 4
names = ["Furkan", "Simon","Mohamed","Edward","Amir"]
# parse template to 10 MFCC and 10 digits
tempMFListVD = []
tempWAV = np.array(np.array_split(tempVD,10))
# parse testing to 110 MFCCs and 110 digits
testMFListVD = []
testWAV = np.array(np.array_split(testVD,10))
for i in range(10):
testMFListVD.append(librosa.feature.mfcc(testWAV[i], sr, n_mfcc = 50))
# sound of some template digits
print("template chunks")
for i in range(10):
pr = "chunk: "+str(i)
sound(tempWAV[i], sr, pr)
# sound of some test digits
# test digits is represented in testIndex mod(10)
print("test names")
for i in range(10):
pr = "chunk: "+str(i)
sound(testWAV[i], sr, pr)
We can now convert the phone numbers from WAV audio file into string.
tempNames = []
phoneNumber = []
for i in range(10):
if i % 2 == 0: tempNames.append(np.array_split(tempWAV[i],4)[0])
else: phoneNumber.append(tempWAV[i])
# Find each audio files' MFCC representation using librosa
tempNamesMF = []
for i in range(recipientNum):
tempNamesMF.append(librosa.feature.mfcc(tempNames[i], sr, n_mfcc = 50))
phoneDigs = []
for i in range(recipientNum):
phoneDigs.append(np.array_split(phoneNumber[i],phoneDigitsAmt))
# Convert phone number to string
phoneNumArr = np.zeros((recipientNum,phoneDigitsAmt))
for i in trange(recipientNum, desc='recipients'):
for j in tqdm(range(phoneDigitsAmt), desc='digits'):
# get phone number digits
curDigMFCC = librosa.feature.mfcc(phoneDigs[i][j], sr, n_mfcc = 50)
curDigit = classify(curDigMFCC, tempMFList)
phoneNumArr[i][j]=curDigit
phoneNumStr = []
for i in range(recipientNum):
curStr = ""
for j in range(phoneDigitsAmt):
curStr += str(int(phoneNumArr[i][j]))
phoneNumStr.append(curStr)
print(phoneNumStr)
Accuracy for our digit recognizer here is ~90%. We can see that on index 2, Mohamed's phone number is detected as 2783, even though it's supposed to be6683.
We can now test the feature to see if our algorithm correctly calls the spoken name.
# Testing to call "Mohamed", phone number 6683
testIdx = 2
title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)
print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))
# Testing to call "Furkan", phone number 1379
testIdx = 0
title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)
print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))
Now that we have test the feature, we can determine the overall accuracy of the algorithm with some test sets.
In my case, I have 10 different instances for calling the 5 given names (2 for each).
Actual test index is represented as testIndex mod(5).
correctVD = np.zeros(5)
for i in trange(10, desc='Test Set'):
guessedNameIdx = classify(testMFListVD[i], tempNamesMF)
actualIndex = i%5
correctVD[actualIndex] = correctVD[actualIndex]+1 if guessedNameIdx==actualIndex else correctVD[actualIndex]
The summary of the accuracy for each case is listed below:
## Data Summary
totalVDCorrect = int(np.sum(correctVD))
print("Data Summary:\n")
print("Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalVDCorrect/10*100, totalVDCorrect, 10-totalVDCorrect))
for idx, c in enumerate(correctVD):
print("Name {} Accuracy: {}%".format(names[idx%5], c/2*100) )
Based on our small test sets, we can now confirm that our voice driven-dialler is working as expected.
In conclusion, the algorithm used is certainly not optimized for large templates (as it iterates for each template cases, and iterates to find distances for each MFCC frames), which can increase runtime significantly on bigger datasets.
A high accuracy value might be caused by test sets that is fairly similar (I recorded both template and test at similar conditions).